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This paper focuses on obtaining clustering information about a 
£-"v ■ distribution from its i.i.d. samples. We develop theoretical results to 

rsj | understand and use clustering information contained in the eigen- 

vectors of data adjacency matrices based on a radial kernel function 
with a sufficiently fast tail decay. In particular, we provide population 
analyses to gain insights into which eigenvectors should be used and 
when the clustering information for the distribution can be recovered 
from the sample. We learn that a fixed number of top eigenvectors 
j^ , might at the same time contain redundant clustering information and 

miss relevant clustering information. We use this insight to design the 
data spectroscopic clustering (DaSpec) algorithm that utilizes prop- 
erly selected eigenvectors to determine the number of clusters au- 
tomatically and to group the data accordingly. Our findings extend 
the intuitions underlying existing spectral techniques such as spectral 
clustering and Kernel Principal Components Analysis, and provide 
new understanding into their usability and modes of failure. Simu- 
{•f. . lation studies and experiments on real-world data are conducted to 

show the potential of our algorithm. In particular, DaSpec is found 
to handle unbalanced groups and recover clusters of different shapes 
better than the competing methods. 
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1. Introduction. Data clustering based on eigenvectors of a proximity or 
affinity matrix (or its normalized versions) has become popular in machine 
$— i ' learning, computer vision and many other areas. Given data xi, . . . , x n G mc 
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this family of algorithms constructs an affinity matrix (K n )ij = K(xi,Xj)/n 
based on a kernel function, such as a Gaussian kernel K(x, y) = e~" x ~ y " '^ 2lu ' . 
Clustering information is obtained by taking eigenvectors and eigenvalues of 
the matrix K n or the closely related graph Laplacian matrix L n = D n — K n , 
where D n is a diagonal matrix with (D n )a = J2j(K n )ij- The basic intuition 
is that when the data come from several clusters, distances between clus- 
ters are typically far larger than the distances within the same cluster, and 
thus K n and L n are (close to) block-diagonal matrices up to a permutation 
of the points. Eigenvectors of such block-diagonal matrices keep the same 
structure. For example, the few top eigenvectors of L n can be shown to 
be constant on each cluster, assuming infinite separation between clusters, 
allowing one to distinguish the clusters by looking for data points corre- 
sponding to the same or similar values of the eigenvectors. 

In particular, we note the algorithm of Scott and Longuet-Higgins [13] 
who proposed to embed data into the space spanned by the top eigenvectors 
of K n , normalize the data in that space and group data by investigating 
the block structure of inner product matrix of normalized data. Perona 
and Freeman [10] suggested to cluster the data into two groups by directly 
thresholding the top eigenvector of K n . 

Another important algorithm, the normalized cut, was proposed by Shi 
and Malik [14] in the context of image segmentation. It separates data into 
two groups by thresholding the second smallest generalized eigenvector of 
L n . Assuming k groups, Malik et al. [6] and Ng, Jordan and Weiss [8] sug- 
gested embedding the data into the span of the bottom k eigenvectors of the 
normalized graph Laplacian 1 I n — D n K n D n and applying the fc-means 
algorithm to group the data in the embedding space. For further discussions 
on spectral clustering, we refer the reader to Weiss [20], Dhillon, Guan and 
Kulis [2] and von Luxburg [18]. An empirical comparison of various meth- 
ods is provided in Verma and Meila [17]. A discussion of some limitations 
of spectral clustering can be found in Nadler and Galun [7]. A theoretical 
analysis of statistical consistency of different types of spectral clustering is 
provided in von Luxburg, Belkin and Bousquet [19]. 

Similarly to spectral clustering methods, Kernel Principal Component 
Analysis (Schblkopf, Smola and Muller [12]) and spectral dimensionality re- 
duction (e.g., Belkin and Niyogi [1]) seek lower dimensional representations 
of the data by embedding them into the space spanned by the top eigen- 
vectors of K n or the bottom eigenvectors of the normalized graph Laplacian 
with the expectation that this embedding keeps nonlinear structure of the 
data. Empirical observations have also been made that KPCA can some- 
times capture clusters in the data. The concept of using eigenvectors of 
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the kernel matrix is also closely connected to other kernel methods in the 
machine learning literature, notably Support Vector Machines (cf. Vapnik 
[16] and Scholkopf and Smola [11]), which can be viewed as fitting a linear 
classifier in the eigenspace of K n . 

Although empirical results and theoretical studies both suggest that the 
top eigenvectors contain clustering information, the effectiveness of these 
algorithms hinges heavily on the choice of the kernel and its parameters, the 
number of the top eigenvectors used, and the number of groups employed. 
As far as we know, there are no explicit theoretical results or practical 
guidelines on how to make these choices. Instead of tackling these questions 
regarding to particular data sets, it may be more fruitful to investigate 
them from a population point of view. Williams and Seeger [21] investigated 
the dependence of the spectrum of K n on the data density function and 
analyzed this dependence in the context of lower rank matrix approximations 
to the kernel matrix. To the best of our knowledge, this work was the first 
theoretical study of this dependence. 

In this paper we aim to understand spectral clustering methods based on 
a population analysis. We concentrate on exploring the connections between 
the distribution P and the eigenvalues and eigenfunctions of the distribution- 
dependent convolution operator, 

(1.1) ICpf(x) = J K(x, y)f(y) dP{y). 

The kernels we consider will be positive (semi-)definite radial kernels. Such 
kernels can be written as K(x,y) = k(\\x — y\\), where k: [0, oo) — ► [0,oo) is a 
decreasing function. We will use kernels with sufficiently fast tail decay, such 
as the Gaussian kernel or the exponential kernel K(x,y) = e~\\ x ~v\\/ u . The 
connections found allow us to gain some insights into when and why these 
algorithms are expected to work well. In particular, we learn that a fixed 
number of top eigenvectors of the kernel matrix do not always contain all of 
the clustering information. In fact, when the clusters are not balanced and/or 
have different shapes, the top eigenvectors may be inadequate and redundant 
at the same time. That is, some of the top eigenvectors may correspond to 
the same cluster while missing other significant clusters. Consequently, we 
devise a clustering algorithm that selects only those eigenvectors which have 
clustering information not represented by the other eigenvectors already 
selected. 

The rest of the paper is organized as follows. In Section 2, we cover the 
basic definitions, notation and mathematical facts about the distribution- 
dependent convolution operator and its spectrum. We point out the strong 
connection between Kp and its empirical version, the kernel matrix K n , 
which allows us to approximate the spectrum of Kp given data. 
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In Section 3, we characterize the dependence of eigenfunctions of KLp on 
both the distribution P and the kernel function K(-,-). We show that the 
eigenfunctions of Kp decay to zero at the tails of the distribution P and that 
their decay rates depends on both the tail decay rate of P and that of the 
kernel K (■,■). For distributions with only one high density component, we 
provide theoretical analysis. A discussion of three special cases can be found 
in the Appendix A. In the first two examples, the exact form of the eigen- 
functions of Kp can be found; in the third, the distribution is concentrated 
on or around a curve in Mr. 

Further, we consider the case when the distribution P contains several 
separate high-density components. Through classical results of the pertur- 
bation theory, we show that the top eigenfunctions of tCp are approximated 
by the top eigenfunctions of the corresponding operators defined on some of 
those components. However, not every component will contribute to the top 
few eigenfunctions of K,p as the eigenvalues are determined by the size and 
configuration of the corresponding component. Based on this key property, 
we show that the top eigenvectors of the kernel matrix may or may not 
preserve all clustering information, which explains some empirical observa- 
tions of certain spectral clustering methods. A real-world high-dimensional 
dataset, the USPS postal code digit data, is also analyzed to illustrate this 
property. 

In Section 4, we utilize our theoretical results to construct the data spec- 
troscopic clustering (DaSpec) algorithm that estimates the number of groups 
data-dependently, assigns labels to each observation, and provides a classifi- 
cation rule for unobserved data, all based on the same eigen decomposition. 
Data-dependent choices of algorithm parameters are also discussed. In Sec- 
tion 5, the proposed DaSpec algorithm is tested on two simulations against 
commonly used fc-means and spectral clustering algorithms. In both situa- 
tions, the DaSpec algorithm provides favorable results even when the other 
two algorithms are provided with the number of groups in advance. Section 
6 contains conclusions and discussion. 

2. Notation and mathematical preliminaries. 

2.1. Distribution- dependent convolution operator. Given a probability dis- 
tribution P on Mr, we define Lp(M ) to be the space of square integrable 
functions, / € L p (M d ) if J f 2 dP < oo, and the space is equipped with an 
inner product (/,<?) = J fgdP. Given a kernel (symmetric function of two 
variables) K(x, y) : M d xi <i ->l, (1-1) defines the corresponding integral op- 
erator Kp . Recall that an eigenfunction : ]R rf i— > IR and the corresponding 
eigenvalue A of /Cp are defined by the following equations: 

(2.1) K,p(J) = \(J), 
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and the constraint / (j) 2 dP = 1. If the kernel satisfies the condition 

(2.2) JjK 2 (x,y)dP(x)dP(y)<< X , 

the corresponding operator /Cp is a trace class operator, which, in turn, 
implies that it is compact and has a discrete spectrum. 

In this paper, we will only consider the case when a positive semi-definite 
kernel K(x,y) and a distribution P generate a trace class operator /Cp, so 
that it has only countable nonnegative eigenvalues Ao > Ai > A2 > • • • > 0. 
Moreover, there is a corresponding orthonormal basis in L 2 of eigenfunctions 
<j>i satisfying (2.1). The dependence of the eigenvalues and eigenfunctions 
of /Cp on P will be one of the main foci of our paper. We note that an 
eigenfunction (j> is uniquely defined not only on the support of P, but on 
every point x € M. d through cj)(x) = jj K(x, y)4>{y) dP(y), assuming that the 
kernel function K is defined everywhere on R rf x M. d . 

2.2. Kernel matrix. Let x±, . . . , x n be an i.i.d. sample drawn from distri- 
bution P. The corresponding empirical operator /Cp n is defined as 

r 1 n 

KpJ{x)= K{x,y)f{y)dP n (y) = -Y J K{x,x l )f{x i ). 
j a . .. 

This operator is closely related to the n x n kernel matrix K n , where 

(K n ) ij =K(x i ,Xj)/n. 

Specifically, the eigenvalues of K,p n are the same as those of K n and an 
eigenfunction cp, with an eigenvalue A 7^ of /Cp n , is connected with the 
corresponding eigenvector v = [v\, V2, ■ ■ ■ , v n ]' of K n by 

1 n 
<j){x) = — YjK(x,Xi)v i VxGlR d . 

It is easy to verify that /Cp n </> = \<fi. Thus values of 4> at locations x±, . . . , x n 
coincide with the corresponding entries of the eigenvector v. However, unlike 
v, <p is defined everywhere in M. d . For the spectrum of KLp n and K n , the only 
difference is that the spectrum of fCp n contains with infinite multiplicity. 
The corresponding eigenspace includes all functions vanishing on the sample 
points. 

It is well known that, under mild conditions and when d is fixed, the 
eigenvectors and eigenvalues of K n converge to eigenfunctions and eigenval- 
ues of /Cp as n — > 00 (e.g., Koltchinskii and Gine [4]). Therefore, we expect 
the properties of the top eigenfunctions and eigenvalues of /Cp also hold for 
K n , assuming that n is reasonably large. 
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3. Spectral properties of TCp. In this section, we study the spectral prop- 
erties of ICp and their connection to the data generating distribution P. We 
start with several basic properties of the top spectrum of ICp and then inves- 
tigate the case when the distribution P is a mixture of several high-density 
components. 

3.1. Basic spectral properties of ICp. Through Theorem 1 and its corol- 
lary, we obtain an important property of the eigenfunctions of ICp, that is, 
these eigenfunctions decay fast when away from the majority of masses of 
the distribution if the tails of K and P have a fast decay. A second theorem 
offers the important property that the top eigenfunction has no sign change 
and multiplicity one. (Three detailed examples are provided in Appendix A 
to illustrate these two important properties.) 

Theorem 1 (Tail decay property of eigenfunctions). An eigenfunction 
cj) with the corresponding eigenvalue A > of ICp satisfies 




[K(x,y)) 2 dP(y). 



Proof. By the Cauchy-Schwarz inequality and the definition of eigen- 
function (2.1), we see that 



\\<P(x)\ 



K(x,y)<P(y)dP(y) 



< K(x,y)\cP(y)\dP(y) 



<Jf[K(x,y)YdP(y)Jj[ ( t ) (y)YdP(y) = Jf[K(x,y)YdP{y). 

The conclusion follows. □ 

We see that the "tails" of eigenfunctions of ICp decay to zero and that 
the decay rate depends on the tail behaviors of both the kernel K and the 
distribution P. This observation will be useful to separate high-density areas 
in the case of P having several components. Actually, we have the following 
corollary immediately: 

Corollary 1. Let K(x,y) = k(\\x — y\\) and k(-) being nonincreasing. 
Assume that P is supported on a compact set D C Mr. Then 

|0(s)|< fe(dist(a ' g)) , 
A 

where dist(x, D) = ml y ^£, \\x — y\\ . 
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The proof follows from Theorem 1 and the fact that k(-) is a nonincreasing 
function. And now we give an important property of the top (corresponding 
to the largest eigenvalue) eigenfunction. 

Theorem 2 (Top eigenfunction). Let K(x,y) be a positive semi-definite 
kernel with full support on M . The top eigenfunction 4>o(x) of the convolu- 
tion operator JCp: 

1. is the only eigenfunction with no sign change on M. d ; 

2. has multiplicity one; 

3. is nonzero on the support of P. 

The proof is given in Appendix B and these properties will be used later 
when we propose our clustering algorithm in Section 4. 

3.2. An example: top eigenf unctions oflCp for mixture distributions. We 
now study the spectrum of K,p defined on a mixture distribution 

(3.1) P=f>^ 

9=1 

which is a commonly used model in clustering and classification. To reduce 
notation confusion, we use italicized superscript 1,2,. . . ,g,.. . ,G as the in- 
dex of the mixing component and ordinary superscript for the power of a 
number. For each mixing component P 9 , we define the corresponding oper- 
ator K,pg as 

tC P3 f(x)=J K{x,y)f{y)dPV{y). 

We start by a mixture Gaussian example given in Figure 1. Gaussian 
kernel matrices K n , K^ and K 2 (uj = 0.3) are constructed on three batches 
of 1000 i.i.d. samples from each of the three distributions: 0.5iV(2, l 2 ) + 
0.5iV(-2, l 2 ), JV(2, l 2 ) and N(-2, l 2 ). We observe that the top eigenvectors 
of K n are nearly identical to the top eigenvectors of K^ or K 2 ■ 

From the point of view of the operator theory, it is easy to understand 
this phenomenon: with a properly chosen kernel, the top eigenf unctions of an 
operator defined on each mixing component are approximate eigenf unctions 
of the operator defined on the mixture distribution. To be explicit, let us 
consider the Gaussian convolution operator K,p defined by P = it 1 P 1 + 
tt 2 P 2 , with Gaussian components P 1 = N(fi 1 , [a 1 ] 2 ) and P 2 = N(/j, 2 , [a 2 ] 2 ) 
and the Gaussian kernel K(x,y) with bandwidth lo. Due to the linearity of 
convolution operators, KLp = tt 1 K, p i + -k 2 K, P 2. 
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Fig. 1. Eigenvectors of a Gaussian kernel matrix (u) = 0.3,) o/ iOOO data sampled from 
a mixture Gaussian distribution 0.5iV(2, l 2 ) + 0.5N(— 2, l 2 ). Left panels: histogram of the 
data (top), first eigenvector of K n (middle) and second eigenvector of K n (bottom). Right 
panels: histograms of data from each component (top), first eigenvector of K^ (middle) 
and first eigenvector of K® (bottom). 



Consider an eigenfunction (j) 1 (x) of Kpi with the corresponding eigenvalue 
\ 1 ,K, P i(j) 1 (x) = \ 1 (t) 1 {x). We have 

Kp^ 1 (x) = it 1 X 1 cp 1 (x) + vr 2 / K(x, y)<j> ' (y) dP 2 (y) . 



As shown in Proposition 1 in Appendix A, in the Gaussian case, cj) 1 (x) is 
centered at fi 1 and its tail decays exponentially. Therefore, assuming enough 
separation between (j, 1 and [i 2 , tt 2 J K(x,y)4> 1 (y) dP 2 {y) is close to every- 
where, and hence (j) 1 (x) is an approximate eigenfunction of Kp. In the next 
section, we will show that a similar approximation holds for general mixture 
distributions whose components may not be Gaussian distributions. 

3.3. Perturbation analysis. For fCp defined by a mixture distribution 
(3.1) and a positive semi-definite kernel K(-,-), we now study the connec- 
tion between its top eigenvalues and eigenfunctions and those of each JCpg . 
Without loss of generality, let us consider a mixture of two components. We 
state the following theorem regarding the top eigenvalue Aq of K,p . 
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Fig. 2. Illustration of separation condition (3.2) in Theorem 3. 



Theorem 3 (Top eigenvalue of mixture distribution). Let P = tt 1 P 1 + 
-k 2 P 2 be a mixture distribution on R with it 1 + it 2 = 1. Given a positive 
semi-definite kernel K, denote the top eigenvalue of Kip, Kpi and Kpz as 
Aq, Aq and Ajf, respectively. Then Aq satisfies 



1 \1 „2 \2\ 



max(-7r A , 7T A ) < Aq < max(7r A , ir A 



where 
(3.2) 



7T 7T 



K(x,y)] 2 dP 1 (x)dP 2 (y) 



1/2 



The proof is given in Appendix B. As illustrated in Figure 2, the value 
of r in (3.2) is small when P and P 2 do not overlap much. Meanwhile, 
the size of r is also affected by how fast K(x,y) approaches zero as \\x — y\\ 
increases. When r is small, the top eigenvalue of KLp is close to the larger 
one of n 1 Aq and tt 2 Xq. Without loss of generality, we assume it 1 Aq > tt 2 Aq 
in the rest of this section. 

The next lemma is a general perturbation result for the eigenfunctions 
of K,p. The empirical (matrix) version of this lemma appeared in Diaconis, 
Goel and Holmes [3] and more general results can be traced back to Parlett 
[9]. 



Lemma 1. 
Ai > • • • . // 



Consider an operator fCp with the discrete spectrum Aq > 



\\lCpf-\f\\ L 2<£ 



for some X, e > 0, and f £ Lp, then K,p has an eigenvalue A^ such that 
|Afc — A | < e. If we further assume that s = mhij. x^\ k \ \i — \k\ > £■> then Kip 
has an eigenfunction f^ corresponding to A^ such that \\f — fk\\i 2 — ~^- 
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The lemma shows that a constant A must be "close" to an eigenvalue 
of K,p if the operator "almost" projects a function / to Xf . Moreover, the 
function / must be "close" to an eigenfunction of K,p if the distance between 
Kpf and Xf is smaller than the eigen-gaps between A& and other eigenvalues 
of KLp. We are now in a position to state the perturbation result for the top 
eigenfunction of Kp. Given the facts that \Xq — it 1 Xq\ <r and 

Kp^l =-K 1 K P i(j) 1 Q +-n 2 1Cp2<j) 1 Q = (n 1 Xi)^ +-K 2 K P 2(j) 1 Q, 

Lemma 1 indicates that 0q is close to 4>q if \\ir 2 lCp2 4>Q \\l 2 ^ s small enough. 
To be explicit, we formulate the following corollary. 



j p 



Corollary 2 (Top eigenfunction of mixture distribution). Let P = 
tt 1 P 1 + it 2 P 2 be a mixture distribution on M. d with tt 1 + tt 2 = 1. Given 
a semi-positive definite kernel K(-,-), we denote the top eigenvalues of ICpi 
and JCps as Xq and Xq, respectively (assuming vt j Aq >tt 2 Xq) and define 
t = Ao — Ai, the eigen-gap of KLp. If the constant r defined in (3.2) satisfies 
r <t, and 



(3.3) 



vrM K(x,y)4(y)dP 2 (y) 



<£, 



i - 



such that e + r <t, then it 1 Aq is close to K,p 's top eigenvalue Xq, 

k ; A^-A |<e 
and 4>q is close to Kp 's top eigenfunction (f>Q in Lp sense, 

(3-4) H'o-Ml 



p ~ t-e 

The proof is trivial, so it is omitted here. Since Theorem 3 leads to |Aq — 
Ao| <■ r an d Lemma 1 suggests |Aq — A&| < e for some k, the condition r + e< 
t = Ao — Ai guarantees that </>o as the only possible choice for ())q to be close 
to. Therefore, 4>q is approximately the top eigenfunction of Kp. 

It is worth noting that the separable conditions in Theorem 3, Corollary 
2 are mainly based on the overlap of the mixture components, but not on 
their shapes or parametric forms. Therefore, clustering methods based on 
spectral information are able to deal with more general problems beyond the 
traditional mixture models based on a parametric family, such as mixture 
Gaussians or mixture of exponential families. 

3.4. Top spectrum of K,p for mixture distributions. For a mixture dis- 
tribution with enough separation between its mixing components, we now 
extend the perturbation results in Corollary 2 to other top eigenfunctions 
of ICp. With close agreement between (Ao,0o) an d (^Ao,^), we observe 
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that the second top eigenvalue of ICp is approximately max(-7r J A^ , 7t 2 Aq) 
by investigating the top eigenvalue of the operator defined by a new kernel 
K new = K(x, y) — Xo(po(x)(po(y) and P. Accordingly, one may also derive the 
conditions under which the second eigenfunctions of ICp is approximated by 
<fi{ or (J)q, depending on the magnitude of it 1 \{ and vt^Aq. By sequentially 
applying the same argument, we arrive at the following corollary. 

Property 1 (Mixture property of top spectrum). For a convolution 
operator ICp, defined by a semi-positive definite kernel with a fast tail decay 
and a mixture distribution P = J2q=i ^ 9 P 9 with enough separations between 
its mixing components, the top eigenfunctions of ICp are approximately cho- 
sen from the top ones (4>f) oflCpg , i = 0, 1, . . . , n, g = 1 , . . . , G. The ordering 
of the eigenfunctions is determined by mixture magnitudes tt 9 \ 9 . 

This property suggests that each of the top eigenfunctions of ICp corre- 
sponds to exactly one of the separable mixture components. Therefore, we 
can approximate the top eigenfunctions of JCpg through those of ICp when 
enough separations exist among mixing components. However, several of 
the top eigenfunctions of ICp can correspond to the same component and 
a fixed number of top eigenfunctions may miss some components entirely, 
specifically the ones with small mixing weights ir 9 or small eigenvalue \ 9 . 

When there is a large i.i.d. sample from a mixture distribution whose com- 
ponents are well separated, we expect the top eigenvalues and eigenfunctions 
of ICp to be close to those of the empirical operator ICp n . As discussed in 
Section 2.2, the eigenvalues of ICp n are the same as those of the kernel ma- 
trix K n and the eigenfunctions of ICp n coincide with the eigenvectors of K n 
on the sampled points. Therefore, assuming good approximation of ICp n to 
ICp, the eigenvalues and eigenvectors of K n provide us with access to the 
spectrum of ICp. 

This understanding sheds light on the algorithms proposed in Scott and 
Longuet-Higgins [13] and Perona and Freeman [10], in which the top (sev- 
eral) eigenvectors of K n are used for clustering. While the top eigenvectors 
may contain clustering information, smaller or less compact groups may not 
be identified using only the very top part of the spectrum. More eigenvectors 
need to be investigated to see these clusters. On the other hand, information 
in the top few eigenvectors may also be redundant for clustering, as some of 
these eigenvectors may represent the same group. 

3.5. A real-data example: a USPS digits dataset. Here we use a high- 
dimensional U.S. Postal Service (USPS) digit dataset to illustrate the prop- 
erties of the top spectrum of ICp. The data set contains normalized handwrit- 
ten digits, automatically scanned from envelopes by the USPS. The images 
here have been rescaled and size-normalized, resulting in 16 x 16 grayscale 
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images (see Le Cun et al. [5] for details). Each image is treated as a vector Xj 
in M 256 . In this experiment, 658 "3"s, 652 "4"s and 556 "5"s in the training 
data are pooled together as our sample (size 1866). 

Taking the Gaussian kernel with bandwidth u> = 2, we construct the kernel 
matrix K n and compute its eigenvectors vi, V2, . . . ,vi866- We visualize the 
digits corresponding to large absolute values of the top eigenvectors. Given 
an eigenvector Vj, we rank the digits Xj, i = 1,2, ..., 1866, according to 
the absolute value |(vj)j|. In each row of Figure 3, we show the 1st, 36th, 
71st, . . . ,316th digits according to that order for a fixed eigenvector Vj, j = 
1,2,3,15,16,17,48,49,50. It turns out that the digits with large absolute 
values of the top 15 eigenvectors, some shown in Figure 3, all represent 
number "4." The 16th eigenvector is the first one representing "3" and the 
49th eigenvector is the first one for "5." 

The plot of the data embedded using the top three eigenvectors shown 
in the left panel of Figure 4 suggests no separation of digits. These results 
are strongly consistent with our theoretical findings: A fixed number of the 
top eigenvectors of K n may correspond to the same cluster while missing 
other significant clusters. This leads to the failure of clustering algorithms 
only using the top eigenvectors of K n . The fc-means algorithm based on top 
eigenvectors (normalized as suggested in Scott and Longuet-Higgins [13]) 
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36 71 106 141 176 211 246 281 316 



Fig. 3. Digits ranked by the absolute value of eigenvectors vi, V2, . . . , V50. The digits in 
each row correspond to the 1st, 36th, 71st, ... ,316th largest absolute value of the selected 
eigenvector. Three eigenvectors, vi, V16 and V49, are identified by our DaSpec algorithm. 
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Fig. 4. Le/£: scatter plots of digits embedded in the top three eigenvectors; right: digits 
embedded in the 1st, IQth and 49th eigenvectors. 

produces accuracies below 80% and reaches the best performance as the 
49th eigenvector is included. 

Meanwhile, the data embedded in the 1st, 16th and 49th eigenvectors (the 
right panel of Figure 4) do present the three groups of digits "3," "4" and 
"5" nearly perfectly. If one can intelligently identify these eigenvectors and 
cluster data in the space spanned by them, good performance is expected. In 
the next section, we utilize our theoretical analysis to construct a clustering 
algorithm that automatically selects these most informative eigenvectors and 
groups the data accordingly. 

4. A data spectroscopic clustering (DaSpec) algorithm. In this section, 
we propose a data spectroscopic clustering (DaSpec) algorithm based on 
our theoretical analyses. We chose the commonly used Gaussian kernel, but 
it may be replaced by other positive definite radial kernels with a fast tail 
decay rate. 



4.1. Justification and the DaSpec algorithm. As shown in Property 1 for 
mixture distributions in Section 3.4, we have access to approximate eigen- 
functions of KLpg through those of Kp when each mixing component has 
enough separation from the others. We know from Theorem 2 that among 
the eigenfunctions of each component Kpg , the top one is the only eigenfunc- 
tion with no sign change. When the spectrum of ICpg is close to that of ICp, 
we expect that there is exactly one eigenfunction with no sign change over 
a certain small threshold e. Therefore, the number of separable components 
of P is indicated by the number of eigenfunctions </>(x)'s of ICp with no sign 
change after thresholding. 

Meanwhile, the eigenfunctions of each component decay quickly to zero 
at the tail of its distribution if there is a good separation of components. At 
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a given location x, in the high-density area of a particular component which 
is at the tails of other components, we expect the eigenfunctions from all 
other components to be close to zero. Among the top eigenfunction |<^>q(:e)| 
of /Cps defined on each component p 9 , g = 1 , . . . ,G, the group identity of x 
corresponds to the eigenfunction that has the largest absolute value, |^q(x)|. 
Combining this observation with previous discussions on the approximation 
of K n to ICp, we propose the following clustering algorithm. 

Data spectroscopic clustering (DaSpec) Algorithm. 

Input: Data X\, . . . ,x n € M. d . 
Parameters: Gaussian kernel bandwidth u > 0, thresholds £j > 0. 

Output: Estimated number of separable components G and a cluster 
label L{xi) for each data point Xi, i = 1, . . . , n. 

Step 1. Construct the Gaussian kernel matrix K n : 

\ft-n)ij = 6 , %, J = 1, . . . , 71, 

n 
and compute its eigenvalues Ai , A2 , • ■ • , A„ and eigenvectors v± , v 2 , . . . , v n . 

Step 2. Estimate the number of clusters: 

- Identify all eigenvectors Vj that have no sign changes up to precision Ej. 
[We say that a vector e = (ei, . . . , e n )' has no sign changes up to e if either 
Vi ej > — e or Vi ej < e.\ 

- Estimate the number of groups by G, the number of such eigenvectors. 

- Denote these eigenvectors and the corresponding eigenvalues by Vq , v| , . . . , v|f 
and Aq , Aq , • • • , Aq, respectively. 



Step 3. Assign a cluster label to each data point Xj as 

,,9 

•Oil 



L(xj) = argmax{|fgJ :g = 1,2,. . . ,G}. 



It is obviously important to have data-dependent choices for the param- 
eters of the DaSpec algorithm: uj and £j 's. We will discuss some heuristics 
for those choices in the next section. Given a DaSpec clustering result, one 
important feature of our algorithm is that little adjustment is needed to 
classify a new data point x. Thanks to the connection between the eigen- 
vector v of K n and the eigenfunction <p of the empirical operator K,p n , we 
can compute the eigenfunction 4>q corresponding to Vq by 



1 n 

( t ) o( x ) = TgY. K ( x i x i) v Oi> xG 



Ag 
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Therefore, Step 3 of the algorithm can be readily applied to any x by re- 
placing Vqi with (pQ (x). So the algorithm output can serve as a clustering 
rule that separates not only the data, but also the underline distribution, 
which is aligned with the motivation behind our Data Spectroscopy algo- 
rithm: learning properties of a distribution though the empirical spectrum 
of K Pn . 

4.2. Data- dependent parameter specification. Following the justification 
of our DaSpec algorithm, we provide some heuristics on choosing algorithm 
parameters in a data-dependent way. 

Gaussian kernel bandwidth oo. The bandwidth controls both the eigen- 
gaps and the tail decay rates of the eigenfunctions. When uj is too large, the 
tails of eigenfunctions may not decay fast enough to make condition (3.3) 
in Corollary 2 hold. However, if u is too small, the eigengaps may vanish, 
in which case each data point will end up as a separate group. Intuitively, 
we want to select small ui but still to keep enough (say, n x 5%) neighbors 
for most (95% of) data points in the "range" of the kernel, which we define 
as a length I that makes P(||X|| < I) = 95%. In case of a Gaussian kernel in 



R d , I = oj^95% quantile of x% 

Given data x±, . . . ,x n or their pairwise L 2 distance d(xi,Xj), we can find 
u) that satisfies the above criteria by first calculating qi = 5% quantile of 
{d(xi,Xj),j = 1, . . . ,re} for each i = 1, . . . ,n, then taking 

95% quantile of {q\, . . . , q n } 



(4.1) W : 



95% quantile of x\ 



As shown in the simulation studies in Section 5, this particular choice of 
uj works well in low-dimensional case. For high-dimensional data generated 
from a lower-dimensional structure, such as an m-manifold, the procedure 
usually leads to an uj that is too small. We suggest starting with to defined in 
(4.1) and trying some neighboring values to see if the results are improved, 
maybe based on some labeled data, expert opinions, data visualization or 
trade-off of the between and within cluster distances. 

Threshold £,-. When identifying the eigenvectors with no sign changes 
in Step 2, a threshold Ej is included to deal with the small perturbation 
introduced by other well-separable mixture components. Since ||vj|| 2 = 1 
and the elements of the eigenvector decrease quickly (exponentially) from 
maxj(|vj(xi)|), we suggest to threshold Vj at Ej = maxj(|vj(xi)|)/n (n as 
the sample size) to accommodate the perturbation. 

We note that the proper selection of algorithm parameters is critical 
to the separation of the spectrum and the success of the clustering algo- 
rithms hinged on the separation. Although the described heuristics seem to 
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Fig. 5. Clustering results on four simulated data sets described in Section 5.1. First 
column: scatter plots of data; second column: results the proposed spectroscopic clustering 
algorithm; third column: results of the k-means algorithm; fourth column: results of the 
spectral clustering algorithm (Ng, Jordan and Weiss [8]). 



work well for low-dimensional datasets (as we will show in the next sec- 
tion), they are still preliminary and more research is needed, especially in 
high-dimensional data analysis. We plan to further study the data-adaptive 
parameter selection procedure in the future. 

5. Simulation studies. 

5.1. Gaussian type components. In this simulation, we examine the ef- 
fectiveness of the proposed DaSpec algorithm on datasets generated from 
Gaussian mixtures. Each data set (size of 400) is sampled from a mixture of 
six bivariate Gaussians, while the size of each group follows a Multinomial 
distribution (n = 400, and pi = ■ ■ ■ =Pe = 1/6). The mean and standard de- 
viation of each Gaussian are randomly drawn from a Uniform on (—5,5) 
and a Uniform on (0,0.8), respectively. Four data sets generated from this 
distribution are plotted in the left column of Figure 5. It is clear that the 
groups may be highly unbalanced and overlap with each other. Therefore, 
rather than trying to separate all six components, we expect good cluster- 
ing algorithms to identify groups with reasonable separations between high 
density areas. 

The DaSpec algorithm is applied with parameters u) and Sj chosen by 
the procedure described in Section 4.2. Taking the number of groups iden- 
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tified by our Daspec algorithm, the commonly used /c-means algorithm and 
the spectral clustering algorithms proposed in Ng, Jordan and Weiss [8] 
(using the same uj as the DaSpec) are also tested to serve as baselines for 
comparison. As a common practice with fc-means algorithm, fifty random 
initializations are used and the final results are from the one that minimizes 
the optimization criterion J27=i( x i ~ Vk(i)) 2 i where Xj is assigned to group 
k(i) and y k = £F=i Xi I(k(i) = fc)/ELi KW) = k). 

As shown in the second column of Figure 5, the proposed DaSpec al- 
gorithm (with data-dependent parameter choices) identifies the number of 
separable groups, isolates potential outliers and groups data accordingly 
The results are similar to the /c-means algorithm results (the third column), 
when the groups are balanced and their shapes are close to round. In these 
cases, the A;-means algorithm is expected to work well, given that the data 
in each group is well represented by its average. The last column shows the 
results of Ng et al.'s spectral clustering algorithm, which sometimes (see the 
first row) assigns data to one group even when they are actually far away. 

In summary, for this simulated example, we find that the proposed DaSpec 
algorithm, with data-adaptively chosen parameters, identifies the number 
of separable groups reasonably well and produces good clustering results 
when the separations are large enough. It is also interesting to note that 
the algorithm isolates possible "outliers" into a separate group so that they 
do not affect the clustering results on the majority of data. The proposed 
algorithm competes well against the commonly used fc-means and spectral 
clustering algorithms. 

5.2. Beyond Gaussian components. We now compare the performance 
of the aforementioned clustering algorithms on data sets that contain non- 
Gaussian groups, various levels of noise and possible outliers. Data set T>\ 
contains three well-separable groups and an outlier in ]R 2 . The first group of 
data is generated by adding independent Gaussian noise iV((0, 0) T , 0.15 2 l2 X 2) 
to 200 uniform samples from three fourths of a ring with radius 3, which 
is from the same distribution as those plotted in the right panel of Fig- 
ure 8. The second group includes 100 data points sampled from a bivariate 
Gaussian iV((3, — 3) T ,0.5 2 /2x2)> and the last group has only 5 data points 
sampled from a bivariate Gaussian A r ((0,0) T ,0.3 2 /2x2)- Finally, one outlier 
is located at (5,5) . Given T>\, three more data sets (T>2, £>3 and P4) are 
created by gradually adding independent Gaussian noise (with standard de- 
viations 0.3, 0.6, 0.9, respectively). The scatter plots of the four datasets are 
shown in the left column of Figure 6. It is clear that the degree of separation 
decreases from top to bottom. 

Similarly to the previous simulation, we examine the DaSpec algorithm 
with data-driven parameters, the k- means and Ng et al.'s spectral clustering 
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Fig. 6. Clustering results on four simulated data sets described in Section 5.2. First col- 
umn: scatter plots of data; second column: labels of the G identified groups by the proposed 
spectroscopic clustering algorithm; third and forth columns: k-means algorithm assuming 
G — 1 and G groups, respectively; fifth and sixth columns: spectral clustering algorithm 
(Ng, Jordan and Weiss [8]) assuming G — 1 and G groups, respectively. 



algorithms on these data sets. The latter two algorithms are tested under two 
different assumptions on the number of groups: the number (G) identified 
by the DaSpec algorithm or one group less {G — 1). Note that the DaSpec 
algorithm claims only one group for P4, so the other two algorithms are not 
applied to P4. 

The DaSpec algorithm (the second column in the right panel of Figure 
6) produces a reasonable number of groups and clustering results. For the 
perfectly separable case in V\ , three groups are identified and the one outlier 
is isolated out. It is worth noting that the incomplete ring is separated from 
other groups, which is not a simple task for algorithms based on group cen- 
troids. We also see that the DaSpec algorithm starts to combine inseparable 
groups as the components become less separable. 

Not surprisingly, the /c-means algorithms (the third and fourth columns) 
do not perform well because of the presence of the non-Gaussian component, 
unbalanced groups and outliers. Given enough separations, the spectral clus- 
tering algorithm reports reasonable results (the fifth and sixth columns). 
However, it is sensitive to outliers and the specification of the number of 
groups. 
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6. Conclusions and discussion. Motivated by recent developments in ker- 
nel and spectral methods, we study the connection between a probability 
distribution and the associated convolution operator. For a convolution op- 
erator defined by a radial kernel with a fast tail decay, we show that each 
top eigenfunction of the convolution operator defined by a mixture distri- 
bution is approximated by one of the top eigenfunctions of the operator 
corresponding to a mixture component. The separation condition is mainly 
based on the overlap between high-density components, instead of their ex- 
plicit parametric forms, and thus is quite general. These theoretical results 
explain why the top eigenvectors of kernel matrix may reveal the clustering 
information but do not always do so. More importantly, our results reveal 
that not every component will contribute to the top few eigenfunctions of 
the convolution operator tCp because the size and configuration of a com- 
ponent decides the corresponding eigenvalues. Hence the top eigenvectors of 
the kernel matrix may or may not preserve all clustering information, which 
explains some empirical observations of certain spectral clustering methods. 

Following our theoretical analyses, we propose the data spectroscopic clus- 
tering algorithm based on finding eigenvectors with no sign change. Compar- 
ing to commonly used &-means and spectral clustering algorithms, DaSpec is 
simple to implement and provides a natural estimator of the number of sep- 
arable components. We found that DaSpec handles unbalanced groups and 
outliers better than the competing algorithms. Importantly, unlike fc-means 
and certain spectral clustering algorithms, DaSpec does not require random 
initialization, which is a potentially significant advantage in practice. Simu- 
lations show favorable results compared to fc-means and spectral clustering 
algorithms. For practical applications, we also provide some guidelines for 
choosing the algorithm parameters. 

Our analyses and discussions on connections to other spectral or kernel 
methods shed light on why radial kernels, such as Gaussian kernels, perform 
well in many classification and clustering algorithms. We expect that this 
line of investigation would also prove fruitful in understanding other kernel 
algorithms, such as Support Vector Machines. 

APPENDIX A 

Here we provide three concrete examples to illustrate the properties of 
the eigenfunction of K,p shown in Section 3.1. 

Example 1 (Gaussian kernel, Gaussian density). Let us start with the 
univariate Gaussian case where the distribution P ~ N(fi, a 2 ) and the kernel 
function is also Gaussian. Shi, Belkin and Yu [15] provided the eigenvalues 
and eigenfunctions of KLp, and the result is a slightly refined version of a 
result in Zhu et al. [22]. 
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Proposition 1. For P ~ N(fi,a 2 ) and a Gaussian kernel K(x,y) = 
e -(x-y) /(2u> ) ^ i e £ p = 2a 2 /u> 2 and let Hi(x) be the ith order Hermite poly- 
nomial. Then eigenvalues and eigenf unctions of KLp for i = 0, 1, . . . are given 
by 



A, 







(l + p + VT+W) V l + P + VT+2P 
(x-fj,) 2 ^/T+2P-l 



(1 + 2/?) 1 / 8 
fa(x) = == — exp 



2a 2 



H; 



4 2 



1/4 



x — /J, 

a 



Here Hk is the /cth order Hermite polynomial. Clearly from the explicit 
expression and expected from Theorem 2, </>o is the only positive eigenfunc- 
tion of ICp. We note that each eigenfunction fa decays quickly (as it is a 
Gaussian multiplied by a polynomial) away from the mean \i of the probabil- 
ity distribution. We also see that the eigenvalues of K,p decay exponentially 
with the rate dependent on the bandwidth of the Gaussian kernel u and the 
variance of the probability distribution a 2 . These observations can be easily 
generalized to the multivariate case; see Shi, Belkin and Yu [15]. 

Example 2 (Exponential kernel, uniform distribution on an interval). 
To give another concrete example, consider the exponential kernel K(x,y) = 
exp(— ~ ) for the uniform distribution on the interval [—1,1] C M.. In Di- 
aconis, Goel and Holmes [3] it was shown that the eigenfunctions of this 
kernel can be written as cos(bx) or sin(for) inside the interval [—1,1] for 
appropriately chosen values of b and decay exponentially away from it. The 
top eigenfunction can be written explicitly as follows: 

<Kx) = \[ , 

A J[-l,l] 



- lx - yl/uJ cos(by)dy 



Vx € 



where A is the corresponding eigenvalue. Figure 7 illustrates an example of 
this behavior, for to = 0.5. 
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Fig. 7. Top two eigenfunctions of the exponential kernel with bandwidth u) — 0.5 and the 
uniform distribution on [—1,1]. 
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Example 3 (A curve in M. ). We now give a brief informal discussion of 
the important case when our probability distribution is concentrated on or 
around a low-dimensional submanifold in a (potentially high dimensional) 
ambient space. The simplest example of this setting is a Gaussian distribu- 
tion, which can be viewed as a zero-dimensional manifold (the mean of the 
distribution) plus noise. 

A more interesting example of a manifold is a curve in M. d . We observe that 
such data is generated by any time-dependent smooth deterministic process, 
whose parameters depend continuously on time t. Let ijj(t) : [0, 1] — ► ]R be 
such a curve. Consider a restriction of the kernel ICp to ip. Let x, y G ij) and 
let d(x,y) be the geodesic distance along the curve. It can be shown that 
d(x,y) = \\x — y\\+0(\\x — y|| 3 ), when x,y are close, with the remainder term 
depending on how the curve is embedded in Mr. Therefore, we see that if 
the kernel ICp is a sufficiently local radial basis kernel, the restriction of ICp 
to tp is a perturbation of ICp in a one-dimensional case. For the exponential 
kernel, the one-dimensional kernel can be written explicitly (see Example 2), 
and we have an approximation to the kernel on the manifold with a decay 
off the manifold (assuming that the kernel is a decreasing function of the 
distance). For the Gaussian kernel, a similar extension holds, although no 
explicit formula can be easily obtained. 

The behaviors of the top eigenfunction of the Gaussian and exponential 
kernel, respectively, are demonstrated in Figure 8. The exponential kernel 
corresponds to the bottom left panel. The behavior of the eigenfunction is 
seen generally consistent with the top eigenfunction of the exponential kernel 
on [—1, 1] shown in Figure 8. The Gaussian kernel (top left panel) has similar 
behaviors but produces level lines more consistent with the data distribution, 
which may be preferable in practice. Finally, we observe that the addition 
of small noise (right top and bottom panels) does not significantly change 
the eigenfunctions. 

APPENDIX B 

Proof of Theorem 2. For a semi-positive definite kernel K(x,y) with 
full support on M d , we first show the top eigenfunction </>q of ICp has no sign 
change on the support of the distribution. We define R + = {x G M d : 4>o(x) > 
0} , R- = {x G M d : 4> (x) < 0} and <j) (x) = \ (f> (x) \ . It is clear that J^dP = 
f<fidP = l. 

Assuming that P(R + ) > and P(R~) > 0, we will show that 

JjK(x,y)Mx)My)dP(x)dP(y) 

> [ [ K(x,y)Mx)My)dP(x)dP(y), 
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Fig. 8. Contours of the top eigenfunction of K,p for Gaussian (upper panels) and expo- 
nential kernels (lower panels) with bandwidth 0.7. The curve is 3/4 of a ring with radius 
3 and independent noise of standard deviation 0.15 added in the right panels. 



which contradicts with the assumption that <^o(") is the eigenfunction as- 
sociated with the largest eigenvalue. Denoting g(x,y) = K(x,y)(fto(x)(f)Q(y) 
and g(x,y) = K(x,y)<j)o(x)4>o(y), we have 



g(x,y)dP(x)dP(y) 



R+ Jr+ 



R+ JR+ 



g(x,y)dP(x)dP(y) 



and the equation also holds on region R x R . However, over the region 
{(x,y) :x £ R + and y € R~}, we have 



g(x,y)dP(x)dP(y) 

g(x,y)dP(x)dP(y), 



R+ JR. 
> 



R.+ JR 
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since K(x, y) > 0, 4>o(x) > and 4>o(y) < 0. The inequality holds on {(x, y) : x G 
R~ and y £ R + }. Putting four integration regions together, we arrive at the 
contradiction. Therefore, the assumptions P(R + ) > and P(R~) > can- 
not be true at the same time, which implies that (f>o(-) has no sign changes 
on the support of the distribution. 
Now consider \/x £ R . We have 

\oMx) = J K(x,y)My)dP(y). 

Given the facts that Ao > 0, K(x,y) > 0, and 4>o(y) have the same sign on 
the support, it is straightforward to see that 4>o(x) has no sign changes and 
has full support in R rf . Finally, the isolation of (Ao, </>o) follows. If there exist 
another <f> that shares the same eigenvalue Ao with 4>q, they both have no 
sign change and have full support on R d . Therefore, f 4>o(x)(J)(x) dP(x) > 
and it contradicts with the orthogonality between eigenfunctions. □ 

Proof of Theorem 3. By definition, the top eigenvalue of KLp satisfies 
_ JjK(x,y)f(x)f(y)dP(x)dP(y) 

0_m / ax nm?dP( X ) ■ 

For any function /, 

K(x,y)f(x)f(y)dP(x)dP(y) 



TT 



I K(x,y)f(x)f(y)dP 1 (x)dP 1 (y) 



+ [n 2 ] 2 J J K(x,y)f(x)f(y)dP s (x)dP 2 (y) 
+ 27T 1 tt 2 J J K(x, y)f{x)f(y) dP 1 (x) dP 2 (y) 

< [K 1 ?^ J[f{x)] 2 dP 1 {x) + [lT 2 fX 2 J[f{ X )fdP 2 (x) 

+ 2^vr 2 f[K(x,y)f(x)f(y)dP 1 (x)dP(y) 2 . 



Now we concentrate on the last term, 

27T 1 tt 2 [[ K(x, y)f(x)f{y) dP 1 (x) dP 2 (y) 



<27t 1 tt 2 J [K(x, y)} 2 dP 1 (x) dP 2 (y) 



[f{x)) 2 [f(y)YdPi{x)dP2(y) 
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^WtHtt* I f[K(x,y)] 2 dP 1 (x)dP 2 {y) 



x xh 1 [f{x)] 2 dP^x) x U 2 [f{y)} 2 dP 2 (y) 



< J it 1 tt 2 [K{x,y)] 2 dP 1 {x)dP 2 {y) 

x (k 1 j[f{*)? dP 1 (x) + tt 2 J[f(x)} 2 dP 2 (x 
= rJ[f(x)] 2 dP(x), 
where r = {tt 1 ir 2 ff[K{x,y)} 2 dP 1 (x) dP 2 (y)) 1 / 2 . Thus, 
A = max f f K(x,y)f(x)f(y)dP(x)dP(y) 

f:JpdP=lJ J 

< max l^X^ f[f(x)] 2 TT 1 dP 1 (x)+7r 2 X 2 f[f(x)] 2 7r 2 ddP 2 (x)+r 
f:JpdP=ll J J 

<max(7r i Ao,7r 2 AQ) + r. 



The other side of the equality is easier to prove. Assuming it 1 Aq > tt 2 Xq 
and taking the top eigenfunction <£>q of K,pi as f, we derive the following re- 
sults by using the same decomposition on JJ K(x, y)4>l {x)4>o (y) dP(x) dP(y) 
and the facts that / K(x,y)^ (x) ddP 1 (x) = Aq0o (v) and f[<f>o] 2 d pl = l - 
Denoting h(x,y) = K(x,y)<pQ(x)(pQ(y), we have 

JjK(x,y)<ti(x)<ti{y)dP{x)dP(y) 
°" /faj(*)] 2 dP(x) 

[ir 1 ] 2 \i ) +[Tr 2 ] 2 ffh(x,y)dP 2 (x)dP 2 (y) + 2Tr 1 Tr 2 \i ) f[<f ) / ) (x)] 2 dP 2 (x) 
TT 1 +7r 2 J[^{x)] 2 dP 2 (x) 

1 1 ( ^+2^ 2 ^{x)} 2 dP 2 {x) \ [n 2 ] 2 JJh(x,y)dP 2 (x)dP 2 (y) 

71 ° \TT 1 +TT 2 J[^(x)} 2 dP 2 {x) J 7T 1 +7T 2 J[^(x)] 2 dP 2 (x) 

>7TA^. 

This completes the proof. □ 
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